The K-means clustering method partitions the observations into, \(K\) clusters such that the we minimize the total within-cluster variation summed over all clusters (\(K\) must be predefined). The within-cluster variation is defined as the sum of squared Euclidean distances between the observation and the corresponding centroid of the assigned cluster: So we want to minimize: \[ \min_{C_1,...,C_K} \left\{ \sum_{k=1}^K \sum_{x_i \in C_k} (x_i - \mu_k)^2 \right\} \] where \(C_k\) is cluster \(k\) with \(\mu_k\) being the corresponding centroid of this cluster and \(x_i\) is observation \(i\).
The K-means algorithm randomly assigns each object to a random cluster giving us an initial cluster assignment for all observations. We then iterate between the below until the cluster assignments stop changing: 1) For each cluster, the cluster centroid is computed. 2) Reassign each observation to the cluster whose centroid is the closest.
Hierarchical clustering does not need a predefined \(K\) number of clusters, unlike K-means clustering. For the hierarchical clustering, each observation is first assigned to its own cluster giving us \(N\) clusters. All pairwise dissimilarities (usually Euclidean distance) are then measured for the clusters, and the two most similar clusters are then fused into a single cluster (giving us \(N-1\) clusters). The dissimilarity between these two clusters will indicate the height in the dendrogram (explained below). The new pairwise inter-cluster dissimilarities are then computed and once again the most similar clusters are fused - this is continued until all observations are in one single cluster.
Once we have clusters consisting of multiple observations, we have different ways to compute the “linkage” to be used for computing the pairwise dissimilarities. The most common ones are: For these 3, all dissimilarities between the observations in both clusters are computed and then: Complete (Maximal intercluster dissimilarity): the largest of these is recorded. Single (Minimal intercluster dissimilarity): the smallest of these is recorded. Average (Mean intercluster dissimilarity): the average of these is recorded.
Centroid: Dissimilarity between the centroid (average) of each cluster. This can result in undesirable inversions (the merge/fuse line in the dendrogram is lower than the prior one).
The hierarchical clustering can be interpreted with a dendrogram. The dendrogram shows the fusing of the clusters throughout the process, where the dissimilarity between two fused clusters is shown in the height. The dendrogram helps interpret and evaluate a relevant number of clusters to divide the observations into, as we see the dissimilarity between clusters being fused becoming larger and larger - this means we would like to find a decent cutoff, which will then indicate a sensible number of clusters, if this is not predefined.
ARI is a measure of similarity between two data clusterings or partitions. We use this to compare our clustering methods to the original cluster assignments. The ARI is set to be 0, when we have a similarity equal to a random assignment value, meaning that if you get a positive ARI it is better than random assignment and the closer to 1 it is, the more fitting the grouping is. However, if it is negative, the grouping is less similar than random assignment. The ARI is calculated as: \[ ARI = \frac{\text{RI} - \text{Expected RI}}{Max(\text{RI}) - \text{Expected RI}} \] Where the Rand Index (RI) is defined as: \[ RI = \frac{\text{correct similar pairs} + \text{correct dissimilar pairs}}{\text{total possible pairs}} = \frac{a + b}{\begin{pmatrix} n \\ 2 \end{pmatrix}} \]
setwd("~/sc_covid_PiB2023/data_science_project/Data") # path to data folder
dat <- read_h5ad("Pilot_2_rule_them_ALL.h5ad")
The most evident structure seen in the low-dimensional embeddings is cell type clustering, whereas viral load and infection status on cell level seems to cluster more within cell type groupings. Hence we use 8 clusters (k=8) corresponding to the eight cell types.
For each dimension reduction method (PCA, UMAP, tsNE, scVI) we run kmeans and hierarcical clustering on celltypes and compute ARI. Each clustering result and ARI value is stored within a anndata object for each dimension reduction method.
n = 8 # number of clusters
m = length(dat$obsm_keys())
objects = list()
for (idx in 1:m){
# print(dat$obsm_keys()[idx])
# create anndata object to store clusters and metadata
matrix <- as.matrix(dat$obsm[idx]) # extract coordinates
x <- do.call(rbind, matrix)[,1:2] # only use first two columns
colnames(x) <-c("col1", "col2") # rename columns
ad <- AnnData(X = x, obs = dat$obs, obsm = list(coordinates = x)) # convert to anndata object ,
# compute kmeans
k = kmeans(ad$X, n)
ad$obsm$kmeans <- data.matrix(as.factor(k$cluster))
# compute ARI
ARI = adj.rand.index(ad$obsm$kmeans, dat$obs$cellType)
ad$uns$kmeans_ARI <- ARI
# compute hierarchical clustering
dm <- dist(ad$X) # Distance matrix
hc <- hclust(dm, method="centroid") # simple dendrogram
ad$obsm$hierarchical <- data.matrix(as.factor(cutree(hc, n))) # data, number of clusters
# compute ARI
ARI = adj.rand.index(ad$obsm$hierarchical, dat$obs$cellType)
ad$uns$hierarchical_ARI <- ARI
# save as unique anndata object
ad_name <- paste("ad",dat$obsm_keys()[idx] , sep = "_") # name anndata object
assign(ad_name, ad$copy())
objects = c(objects, ad_name)
}
ad_X_pca # show anndata and metadata
## AnnData object with n_obs × n_vars = 5694 × 2
## obs: 'cellType', 'viralLoad', 'PatientID', 'sampleID', 'Is_infected', 'Age', 'Sex', 'CoVID-19 severity'
## uns: 'kmeans_ARI', 'hierarchical_ARI'
## obsm: 'coordinates', 'kmeans', 'hierarchical'
For each dimension reduction method we generate a plot for kmeans and hierarchical clustering, as well as one for the true cell type groups.
theme = theme(legend.position = "none",
axis.title = element_blank(),
axis.ticks = element_blank(),
axis.text = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
methods = c("UMAP", "tSNE", "PCA", "scVI", "scVI, UMAP", "scVI (batch corrected)", "scVI, UMAP (batch corrected)", "scVI, tSNE (batch corrected)", "scVI, tSNE")
for (jdx in 1:m) {
name <- unlist(objects[jdx])
ad <- get(name)
# create kmeans cluster plot
title = paste(c("Kmeans", methods[jdx]), collapse = " ") # setup title
caption = paste(c("ARI", round(ad$uns$kmeans_ARI, 4)), collapse = " ") # setup caption
p <- as.data.frame(ad$obsm) %>%
ggplot(aes(x = coordinates.1, y = coordinates.2, color = kmeans)) +
geom_point() +
theme +
labs(title = title, caption = caption)
# save plot as variable
p_name <- paste("kmeans", name, sep = "_") # create name for plot
assign(p_name, p) # assign name to variable
# create hierarchical cluster plot
title = paste(c("Hierarchical", methods[jdx]), collapse = " ") # setup title
caption = paste(c("ARI", round(ad$uns$hierarchical_ARI, 4)), collapse = " ") # setup caption
p <- as.data.frame(ad$obsm) %>%
ggplot(aes(x = coordinates.1, y = coordinates.2, color = hierarchical)) +
geom_point() +
theme +
labs(title = title, caption = caption)
# save plot as variable
p_name <-
paste("hierarchical", name, sep = "_") # create name for plot
assign(p_name, p) # assign name to variable
# create cell type plot
title = paste(c("Cell Type", methods[jdx]), collapse = " ") # setup title
p <- as.data.frame(ad$obsm) %>%
ggplot(aes(x = coordinates.1, y = coordinates.2, color = ad$obs$cellType )) +
geom_point() +
theme +
labs(title = title)
# save plot as variable
p_name <- paste("cellType", name, sep = "_") # create name for plot
assign(p_name, p) # assign name to variable
}
(kmeans_ad_X_pca | kmeans_ad_X_PCA_tSNE | kmeans_ad_X_PCA_UMAP | kmeans_ad_X_scVI_UMAP) /
(hierarchical_ad_X_pca | hierarchical_ad_X_PCA_tSNE | hierarchical_ad_X_PCA_UMAP | hierarchical_ad_X_scVI_UMAP) /
(cellType_ad_X_pca | cellType_ad_X_PCA_tSNE | cellType_ad_X_PCA_UMAP | cellType_ad_X_scVI_UMAP)
For the scVI model we visualize the distribution of patientID in the low dimensional embedding to see how dominant patient attributes affect clustering, and how the distribution looks after batch correcting for patient.
for (jdx in 1:m) {
name <- unlist(objects[jdx])
ad <- get(name)
# create viral Load plot
title = paste(c("PatientID", methods[jdx]), collapse = " ") # setup title
p <- as.data.frame(ad$obsm) %>%
ggplot(aes(x = coordinates.1, y = coordinates.2, color = ad$obs$PatientID )) +
geom_point() +
theme +
labs(title = title)
# save plot as variable
p_name <- paste("PatientID", name, sep = "_") # create name for plot
assign(p_name, p) # assign name to variable
}
Each color represents a different patient.
(PatientID_ad_X_scVI_tSNE | PatientID_ad_X_scVI_UMAP) /
(PatientID_ad_X_scVI_corrected_tSNE | PatientID_ad_X_scVI_corrected_UMAP)